{
 "metadata": {
  "name": "",
  "signature": "sha256:7af2605633653b0e5a2f3403e88da42a9a707cf65e5dac094d5a283d6ece5ab4"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "Solving <strike>all</strike><sup><s>some</s></sup><sup><sup>several</sup></sup> social data problems with graphs"
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Tools:"
     ]
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "The igraph-python library:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Python wrapper for C/C++ graph libraries\n",
      "# docs at: http://igraph.org/python/doc/python-igraph.pdf\n",
      "# also exists for C++ and R\n",
      "import igraph\n",
      "# Plotting library for igraph graphs\n",
      "# Note: you'll need cairo (not a python package) to install this, and the install pkg is \"py2cairo\"\n",
      "# cairo and py2cairo are NOT pip installable, so have fun\n",
      "import cairo"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "Plotting:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import matplotlib.pyplot as plt\n",
      "%matplotlib inline"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Another open-source option for working with graphs in python in Networkx. I find networkx to be slightly easier, and igraph-python to be much, much faster for computationally intensive algorithms. Also, igraph has more built-in algorithms for handling graphs. For that reason, we're mostly going to use igraph."
     ]
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "The CollectorUtils/network library:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Load pre-generated, pickled user graphs\n",
      "import pickle"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# A helper function we use later\n",
      "def histogram(values):\n",
      "    hist = {}\n",
      "    for v in values:\n",
      "        if v in hist.keys():\n",
      "            hist[v] += 1\n",
      "        else:\n",
      "            hist[v] = 1\n",
      "    return hist"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "These are the two python packages that I wrote to turn social data into igraph objects. \"conversation_builder_igraph\" is simply Jeff's \"conversation_builder\" for igraph (rather than Networkx), and it produces exactly the same graph that \"conversation_builder\" produces for any publisher. \"build_user_graph\" produces a graph with 1 edge for 1 activity (as long as the activity is an interaction between users). The resulting igraph has 1 node/user, and 1 edge/activity. It does support multiple edges/ self loops and all of the edges are directed. I'm still addding publishers to \"build_user_graph,\" but right now it works for Disqus data.   \n",
      "\n",
      "More information on how to use these packages for your own data is on datarama in Disqus Tools Overview here:  \n",
      "https://sites.google.com/a/twitter.com/gnip-data-rama/2014-q2---data-blog/disqus-tools-overview"
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Using igraph:"
     ]
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Important manipulation tools:"
     ]
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "Building a graph:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Say we are creating a graph from scratch, G\n",
      "G = igraph.Graph(directed = True)\n",
      "G.summary()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "If you know Networkx, you know that in Networkx a graph is a dict-of-dicts-of-dicts. In other words, order of node-insertion does not matter, nodes have hashable names (but not orders) and nodes (dictionary keys) are a set. In igraph, nodes are a list of Vertex objects, and while they can be assigned names which you can call them by, they are only necessarily uniquely identified by thier index in that list. Vertices must be inserted into the list (that makes up the graph) in order, and if you try to insert a vertex with the same name/attributes twice, you will get two vertices."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# And we want to add some vertices:\n",
      "# Add n vertices to the graph\n",
      "G.add_vertices(n = 1) # G now has 1 node\n",
      "# or simply call \"add_vertex\" to add 1 vertex\n",
      "G.add_vertex() # G now has 2 nodes\n",
      "# or add a list of vertices by name \n",
      "G.add_vertices([\"Brian\", \"Jeff\"]) # G now has 4 nodes\n",
      "# or add one vertex by name\n",
      "G.add_vertex(\"Fiona\") # G now has 5 nodes\n",
      "G.summary()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Vertex objects are stored in VertexSeq objects, which are at graph.vs\n"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Edge objects are stored in EdgeSeq objects, whic are at graph.es"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# G.vs = the VertexSeq object of the graph G\n",
      "# We can modify the attributes of the vertices like this:\n",
      "G.vs.select(0,1)['name'] = [\"Dr. Skippy\", \"Josh\"]\n",
      "# Or add whole new sets of attributes to the vertices:\n",
      "G.vs['favorite_dinosaur'] = [\"Velociraptor\",\"Argintinosaurus\",\"Buddy\",\"Triceritops\",\"Pterodactyl\"]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# And we can acces those lists of attributes this way:\n",
      "G.vs['favorite_dinosaur']"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "G.vs['favorite_dinosaur']"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now let's add some edges to our graph. Edges can be added by referencing the node by index or by the 'name' attribute, which is treated specially. Note: You can ONLY add edges to existing nodes. If you try to add an edge connecting to a non-existant node, it will raise a ValueError. Not great."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Edges can be added one at a time:\n",
      "# (and ONLY while adding edges one at a time can you specify attributes in the same call)\n",
      "G.add_edge(source = 'Dr. Skippy', target = 'Josh', attributes = {'dinosaur_interactions': 1})\n",
      "# A list at a time (note it's add_edges() now):\n",
      "G.add_edges([('Brian','Fiona'),('Fiona','Brian'),('Dr. Skippy', 'Jeff')])\n",
      "# Referencing names OR vertex indices\n",
      "G.add_edges([(3,0),(3,2),(0,4),(3,4),(3,1)])\n",
      "# Just as notably, multiple edges are allowed:\n",
      "G.add_edges([(1,3),(1,3),(1,3)])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "G.summary()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Edge objects are stored in EdgeSeq objects, which are at graph.es"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# G.es = the EdgeSeq object of the graph G\n",
      "# We can add edge attributes the exact same way we added vertex attributes\n",
      "G.es['dinosaur_interactions'] = [1,2,3,4,5,6,7,8,9,10,11,12]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# The vertex attribute 'label' is treated specially by plot--it appears as a label\n",
      "G.vs['label'] = G.vs['name']\n",
      "# The optional \"layout\" arguement specifies how place the nodes\n",
      "# The Graph method .layout() claculates node placement for the graph based on one of a number of algorithms\n",
      "igraph.plot(G, bbox = (400,400), layout = G.layout('kk'), edge_width = [x for x in G.es['dinosaur_interactions']])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "Manipulating a graph:"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The igraph packages provides many, if not most, classic graph-analysis algorithms, but a lot of the time the real social graph will be much more complicated than an analysis method can deal with. Many algorithms break when presented with multiple edges or directed graphs, and we have to simplify the graph before performing analysis. I've found that the best method is to build the most complicated graph I can store, then simplify it in a way that preserves the information I need for a given analysis."
     ]
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "Graph.simplify()"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Graph.simplify() modifies the graph in place, so first use H = G.copy() to create a graph that we can mess with without loosing G. The first thing that most analysis tools can't deal with is multiple edges."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "H = G.copy()\n",
      "H.simplify(multiple = True, combine_edges = {'dinosaur_interactions' : sum})\n",
      "igraph.plot(H,  bbox = (350,350), layout = H.layout('kk'), edge_width = [x for x in H.es['dinosaur_interactions']])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Checkout how .simplify() combined our edge attributes\n",
      "H.es['dinosaur_interactions']"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "Graph.to_undirected()"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Graph.to_undirected() works much the same way as .simplify(). It has three defualt modes: \"collapse\" (only a single edge should\n",
      "be created from multiple directed edges), \"each\" (every edge will be kept, but no arrowheads), \"mutual\" (one\n",
      "undirected edge for each mutual directed edge pair). Exactly like .simplify(), it then takes an arguement (combine_edges) to specify how to combine edges. combine_edges is a dictionary where each key is an edge attribute and each value is the corresponding function to specify how to combine the values. There are a few default functions (sum, product, mean, median, min, max etc), but one very useful feature of both .to_undirected() and .simplify() is that you can define your own function to result."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def combine_dinosaur_interactions(*args):\n",
      "    d = args[0][0] - args[0][1]\n",
      "    if d < 0:\n",
      "        d = d*(-1)\n",
      "    return d"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "H = G.copy()\n",
      "H.to_undirected(mode = \"mutual\", combine_edges = {'dinosaur_interactions': combine_dinosaur_interactions})\n",
      "H.es['dinosaur_interactions']"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "igraph.plot(H, bbox = (200,200), edge_width = [x for x in G.es['dinosaur_interactions']])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "A few classic ideas:"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "And how to use them in igraph"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Grab a classic social network graph fom http://nexus.igraph.org/\n",
      "S = igraph.Nexus.get(\"UKfaculty\")\n",
      "print igraph.Nexus.info(\"UKfaculty\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# The graph summary:\n",
      "# We can see graph attributes \"(g)\", vertex attributes \"(v)\", and edge attributes \"(e)\"\n",
      "# As well as \"D\" -> dicrected\n",
      "#            \"W\" -> weighted\n",
      "# And 81 (nodes) 817 (edges)\n",
      "S.summary()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# And a big, flashy, wonderful-but-messy plot\n",
      "igraph.plot(S, layout = S.layout(\"kk\"))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Connectedness"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Social network graphs often demonstrate what is called a <i> small-world </i> structure, meaning that while most users are not directly connected to one another they are usually within a few steps of eachother. In a proper small-world network, the average path length is $L \\propto log(N)$.\n",
      "\n",
      "Properties of these kinds of real social networks are:\n",
      "* A \"giant\" connected component (the vast majority of the graph belongs to one connected component)\n",
      "* Short path lengths between people in the giant component\n",
      "* Short average path length in the giant component\n",
      "\n",
      "Reading: <i> Collective dynamics of small-world networks</i> by S. Strogatz and D. Watts"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Returns a VertexClustering object, which is a list-like structure of lists of node indices\n",
      "# mode = 'STRONG' says only mutually connected nodes (in a directed graph) are directed\n",
      "# mode = 'WEAK' treats connectivity as un-directed\n",
      "S_clusters = S.clusters(mode = 'STRONG')\n",
      "cluster_sizes = sorted([len(cluster) for cluster in S_clusters], reverse = True)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# A list of the sizes of the clusters. One giant component + a few disconnected nodes\n",
      "cluster_sizes"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# We see that there is a giant component, and if that is the only subgraph we care about\n",
      "S_giant = S_clusters.giant() # gives us the giant component as its own graph."
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We said that another important measure of small-world-ness is a relatively short distance between nodes. One way to measure this is the eccentricity of each node.\n",
      "From the igraph documentation: \n",
      "\"The eccentricity of a vertex is calculated by measuring the shortest distance from (or to) the vertex, to (or from) all other vertices in the graph, and taking the maximum.\"\n"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# returns a list of eccentricities, in the same order as S_giant.vs\n",
      "eccentricities = S_giant.eccentricity()\n",
      "eccentricity_avg = sum(eccentricities)/len(eccentricities)\n",
      "eccentricity_max = max(eccentricities)\n",
      "print(\"Average eccentricity: {}\".format(eccentricity_avg))\n",
      "print(\"Maximum eccentricity: {}\".format(eccentricity_max))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "ecc = histogram(eccentricities).items()\n",
      "plt.bar([x[0] for x in ecc], [x[1] for x in ecc])\n",
      "plt.xlabel(\"Eccentricity values\")\n",
      "plt.ylabel(\"Frequency\")\n",
      "plt.show()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Another way to measure this is the average path length (between all pairs of nodes):"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "S_giant.average_path_length()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Degree"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "One obvious network measure is degree--the number of nearest neighbors that a vertex has. On a directed graph, degree can be meansured as the total number of edges connected to the vertex $k$, the total number of edges coming from the vertex $k_{out}$, or the total number of edges going to a vertex $k_{in}$. Another degree measure that is often useful is $k_{nn}$, or the average degree $k$ of all of the neighbors of some vertex $n$, and $k_{nn}(k)$, the average nearest-neighbor degree of all nodes with degree $k$."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "k_distribution = histogram(S.degree()).items()\n",
      "kin_distribution = histogram(S.indegree()).items()\n",
      "kout_distribution = histogram(S.outdegree()).items()\n",
      "# This call retruns a 2-tuple with two lists, [0] is simple k_nn, while [1] is k_nn(k)\n",
      "knn = S.knn() \n",
      "knn_func = [(i+1, knn[1][i]) for i in range(0,len(knn[1])) if knn[1][i] > 0]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "fig, axes = plt.subplots(nrows=1, ncols=2, figsize = (14,5))\n",
      "ax0 = axes[0]\n",
      "ax1 = axes[1]\n",
      "\n",
      "width = .35\n",
      "ax0.bar([x[0] for x in kin_distribution], [x[1] for x in kin_distribution], width, color = 'r')\n",
      "ax0.bar([x[0]+width for x in kout_distribution], [x[1] for x in kout_distribution], width, color = 'b')\n",
      "ax0.legend((\"$k_{in}$\",\"$k_{out}$\"))\n",
      "ax0.set_xlabel(\"Degree (k)\")\n",
      "ax0.set_ylabel(\"Number of nodes with degree k\")\n",
      "\n",
      "ax1.plot([x[0] for x in knn_func],[x[1] for x in knn_func], 'b.')\n",
      "ax1.set_xlabel(\"Node degree (k = $ k_{in} + k_{out}$)\")\n",
      "ax1.set_ylabel(\"Average nearest-neighbor degree ($k_{nn}$) for nodes with degree k\")\n",
      "\n",
      "fig.tight_layout()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Betweenness"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Degree (i.e., how many followers a person has on Twitter) is an easy estimate for the influence, importance, Klout etc of a user, but it can be misleading. Imagine an example where all of my followers also follow eachother. In that community I'm not necessarily an important link between people even if I have quite a few followers. <i> However, </i> if I am a single link between many disjoint people, it follows that I have more control over the flow of information. In those two situations I have the same <i> degree </i> but a very different measure of betweenness.   \n",
      "A great cannonical example (which we probably don't have time to go into) is the Padget Florentine Families graph (find it on Nexus.igraph.org), which demonstrates that while the Medicis (the most powerful and well known family at the time) were neither the richest nor the most widely connected, they were an important link between many other powerful families, and so rose to great imporance."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Betweenness centrality: how important a vertex is to the connectivity of its neighbors, i.e., what fraction of shortest paths on the graph ($\\sigma$) pass through (include) the vertex $v$ ($\\sigma(v)$)\n",
      "The betweenness centrality of a vertex $v$ is:\n",
      "$$ g(v) = \\sum_{s \\neq v \\neq t} \\frac{\\sigma_{st}(v)}{\\sigma_{st}} $$"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "centrality = S.betweenness()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Similarly, edge betweenness: how important an <i> edge </i> is to the connectivity of the vertices in the network, i.e., what fraction of shortest paths on the graph from some vertex $s$ to another vetex $t$ ($\\sigma(s,t)$) include the edge $e$ ($\\sigma(s,t \\,| \\, e)$)\n",
      "The betweenness centrality of an egde $e$:\n",
      "$$ g(e) = \\sum_{s, t} \\frac{\\sigma(s,t \\, | \\, e)}{\\sigma(s,t)} $$"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "edge_betweenness = S.edge_betweenness()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Graph, nodes sized by betweenness\n",
      "igraph.plot(S, layout = S.layout(\"kk\"), vertex_size = [x**(.5) for x in centrality])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Assortativity"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The assortativity of nodes is a simple correlation coefficient related to how likely it is for an edge to connect two nodes with some attribute in common vs two edges with some attribute that is different. An assortativity coefficient ($r$) of 0 means that there is effectively no correlation, $0 < r < 1$ means that the graph is assortative for some attribute, and an assortativity coefficient $-1 < r < 0$ means tha the graph is disassortative for some attribute."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Define $e_{ij}$ to be the fraction of edges in the connecting a vertex of type $i$ to type $j$, then the assortativity coefficient is defined as:\n",
      "$$ r =  \\frac{\\sum_i e_{ii} - \\sum_{i,j} e_{ij} e_{ji}}{1 - \\sum_{i,j} e_{ij} e_{ji}} $$\n",
      "\n",
      "Assortativity is easily expressed in terms of a mixing matrix $\\mathbf{e}$, where the $i^{th},j^{th}$ entry of $\\mathbf{e}$ is the fraction of edges from a type $i$ node to a type $j$ node. Then $r$ can be rewritten in terms of the mixing matrix as:\n",
      "$$ \\frac{\\text{Tr}\\mathbf{e} - ||\\mathbf{e}||}{1 - ||\\mathbf{e}^2||} $$"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "From:   \n",
      "<i> Mixing patterns in networks </i> by M.E.J. Newman  \n",
      "<i> Assortative mixing in networks </i> by M.E.J. Newman"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Calculate the assortativity coefficient for the faculty members based on their academic group (a vertex attribute)\n",
      "S.assortativity('Group')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The social network is highly assortative for academic group, meaning members of the same group are more likely to associate than members of separate groups."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "group_to_color = {1.0: 'red', 2.0: 'blue', 3.0: 'yellow', 4.0: 'green'}\n",
      "igraph.plot(S, layout = S.layout(\"kk\"), vertex_color = [group_to_color[group] for group in S.vs['Group']])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We can also use another VertexClustering object to draw a graph of how these four groups are connected:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "VC_of_S_obj = igraph.VertexClustering(S)\n",
      "VC_S_by_Group = VC_of_S_obj.FromAttribute(S, attribute = 'Group')\n",
      "S_by_Group = VC_S_by_Group.cluster_graph(combine_vertices = {'Group': 'first'})\n",
      "S_by_Group.vs['size']  = [len(x) for x in VC_S_by_Group]\n",
      "igraph.plot(S_by_Group, layout = S_by_Group.layout(\"kk\"), bbox = (200,200), \n",
      "            vertex_color = [group_to_color[group] for group in S_by_Group.vs['Group']])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Degree assortativity is a special case of assortativty, and measures the likelihood of connections between nodes with a similar degree."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "S.assortativity_degree()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The network is non-assortative for degree."
     ]
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Similarity"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Temporal & neighborhood similarity (correlation) across nodes. How many neighbors two nodes share (neighborhood overlap) or how much of my neighborhood persists from one time-window to the next). Good reading on measuring temporal correlation (similarity) in <i> Graph Metrics for Temporal Networks </i> by Nicosia et al."
     ]
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Communities"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "A connected graph is not necessarily homogenously connected. We saw <i> assortativity </i> for what types of vertices like to connect to eachother, but what about groups that can't be defined by known vertex attributes? In social networks, some groups of users are more tightly connected than others--we see groups of tightly connected users (\"communities\") that have a few ties going to the world around them. This idea is very important for identifying target groups, finding commonalities, and even spotting bots (check this out from our friends at Sysmos: https://twitter.com/edkim_mw/status/476731214276329472/). Ideas about communities can be usefully extended to other types of networks, like topic identification: think of words as vertices, and their appearance in the same text as an edge. Words that appear together most commonly will form communities--topics.\n",
      "\n",
      "The basic premise for identifying a network comminuty is identifying groups of users that are more tightly connected than others. There are several available algorithms (of varying efficacy, computational cost, and complexity), I'm going to focus on four.  \n",
      "1. Edge betweenness community detection  \n",
      " - <i> Community structure in social and biological networks </i> by  M. Girvan and M.E.J. Newman\n",
      " - Progressively remove the edges in the graph with the highest betweenness, effectively cutting communities off from one another\n",
      " - Very expensive to compute, impractical for large networks\n",
      " - igraph implementation: Graph.community_edge_betweenness()  \n",
      " - Works for weighted, directed graphs\n",
      "2. Modulatity maximization\n",
      " - <i> Finding community structure in very large networks </i> by A. Clauset, M.E.J. Newman, C. Moore\n",
      " - Begin with every vertex in its own community, then greedily maximize <i> modularity</i>, a measure of in-group vs out-group connections by agglomerating existing clusters.\n",
      " - Practical for larger networks.\n",
      " - Has serious problems with resolution--it cannot resolve communties below a certain size, and often mistakenly groups together disjoint communities. A good review of why modularity-maximization based approaches are flawed in this way is presented in <i> Resolution limit in community detection </i> by S. Fortunato and M. Barth\u00e9lemy.\n",
      "  - igraph implementation: Graph.community_fastgreedy()  \n",
      "  - Works for weighted, un-directed graphs\n",
      "3. Random walk\n",
      " - <i> Computing communities in large networks using random walks </i> by P. Pons and M. Latapy  \n",
      " - Based on the idea that a random walker following graph edges will spend a disproportionate amount of time in a community\n",
      " - Slow, but not impossibly so, for larger networks\n",
      " - I have'nt tdone as much reading about this algorithm as the others, so I am not aware of other pros/ cons.\n",
      " - igraph implementation: Graph.community_walktrap()  \n",
      " - Works for weighted, directed graphs\n",
      "4. Probabalistic community groupings\n",
      " - <i> A Bayesian Approach to Network Modularity </i> by J. Hoffman and C. Wiggins, source code at vbmod.sf.net\n",
      " - Fast, no resolution limit\n",
      " - Not implemented in igraph, but source is supplied for python and I bet we could get it working pretty seamlessly with igraph--but I need some time/ help figuring out what the heck is going on with the code.\n",
      " \n",
      "\n",
      "A quick(ish) review of much of the available literature & code on the subject of community detection can be found in <i> Community detection in graphs </i> by S. Fortunato.\n",
      "\n",
      "\n"
     ]
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "1. Community edge betweenness"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "S_dendrogram_eb = S.community_edge_betweenness(weights = 'weight')\n",
      "S_communities_eb = S_dendrogram_eb.as_clustering()\n",
      "igraph.plot(S_dendrogram_eb)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "igraph.plot(S_communities_eb, layout = S.layout(\"kk\"))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "2. Fast-greedy modularity maximization"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# This algorithm only takes undirected graphs, so we get to use some graph manipulation tools from earlier\n",
      "S_undirected = S.copy()\n",
      "S_undirected.to_undirected(mode = \"mutual\", combine_edges = {'weight': sum})\n",
      "S_dendrogram_fg = S_undirected.community_fastgreedy(weights = 'weight')\n",
      "S_communities_fg = S_dendrogram_fg.as_clustering()\n",
      "igraph.plot(S_dendrogram_fg)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "igraph.plot(S_communities_fg, layout = S.layout(\"kk\"))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "3. Random walk, \"walk-trap\" method"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "S_dendrogram_wt = S.community_walktrap(weights = 'weight')\n",
      "S_communities_wt = S_dendrogram_wt.as_clustering()\n",
      "igraph.plot(S_dendrogram_wt)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "igraph.plot(S_communities_wt, layout = S.layout(\"kk\"))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "4. Variational Bayes module detection"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Download the vbmod python distribution on vbmod.sf.net, and figure out how it works."
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Real data!"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Which we did not use before because it's waaaay slower than the little graph we're working with. I would, however, encourage you to evaluate these cells after we're done and see what can work with real live Disqus social network data. Everything below this runs quite nicely with this large dataset, but even so, be prepared to wait a few minutes for each cell to run on your laptop.  \n",
      "\n",
      "This data is all Disqus activity on ~20 top urls in varying topics (business, news, gossip, trivia, consumer info etc) for Disqus traffic over two days (4-21-2014 and 4-22-2014). Each edge in the graph has a few attributes, including a 'conversation' attribute that tells up on what article url the activity that generated that edge was created."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Load the Disqus social network data from my provided graph file\n",
      "# This was created (over the course of several hours) using build_user_graph from the command line\n",
      "Disqus_network = pickle.load(open(\"output_graphs/test_graph.p\",\"r\"))\n",
      "# Make a copy that we can simplify, for the sake of analysis\n",
      "D = Disqus_network.copy()\n",
      "# Define our own functions to combine edges\n",
      "def concat_conv(*args):\n",
      "    return args[0]\n",
      "def concat_list(*args):\n",
      "    return [x for sublist in args[0] for x in sublist]\n",
      "# Simply the graph--remove multiple edges, remove self loops\n",
      "D.simplify(multiple = True, loops = True, \n",
      "           combine_edges = {'weight':sum, 'conversation': concat_conv, 'record':'ignore', 'time':'ignore'})\n",
      "# Find the giant connected component (for most of out analysis we'll use this piece)\n",
      "# Make note of how many nodes this giant component has compared to the rest of the graph.\n",
      "D_clusters = D.clusters(mode = 'WEAK')\n",
      "D_giant = D_clusters.giant()\n",
      "# And make an undirected copy of the giant component, which we can use for algorothms that need undirected input\n",
      "D_undirected = D_giant.as_undirected(mode = 'collapse', combine_edges = {'weight':sum, 'conversation': concat_list})"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Let's look at what we've got:\n",
      "print(\"Disqus_network, the totally un-simplified graph: {} \\n\".format(Disqus_network.summary()))\n",
      "print(\"D, without multi-edges and without self-loops: {} \\n\".format(D.summary()))\n",
      "print(\"D_giant, the giant (strongly connected) component: {} \\n\".format(D_giant.summary()))\n",
      "print(\"D_undirected, the undirected graph: {} \\n\".format(D_undirected.summary()))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "cluster_sizes = [len(x) for x in D_clusters]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "Degree"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "k_distribution_D = histogram(D.degree()).items()\n",
      "kin_distribution_D = histogram(D.indegree()).items()\n",
      "kout_distribution_D = histogram(D.outdegree()).items()\n",
      "# This call retruns a 2-tuple with two lists, [0] is simple k_nn, while [1] is k_nn(k)\n",
      "knn_D = D.knn() \n",
      "knn_func_D = [(i+1, knn_D[1][i]) for i in range(0,len(knn_D[1])) if knn_D[1][i] > 0]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "fig_D, axes_D = plt.subplots(nrows=1, ncols=2, figsize = (14,5))\n",
      "ax0_D = axes_D[0]\n",
      "ax1_D = axes_D[1]\n",
      "\n",
      "width_D = .35\n",
      "ax0_D.bar([x[0] for x in kin_distribution_D[0:30]], [x[1] for x in kin_distribution_D[0:30]], width_D, color = 'r')\n",
      "ax0_D.bar([x[0]+width for x in kout_distribution_D[0:30]], [x[1] for x in kout_distribution_D[0:30]], width_D, color = 'b')\n",
      "ax0_D.legend((\"$k_{in}$\",\"$k_{out}$\"))\n",
      "ax0_D.set_xlabel(\"Degree (k)\")\n",
      "ax0_D.set_ylabel(\"Number of nodes with degree k\")\n",
      "\n",
      "ax1_D.plot([x[0] for x in knn_func_D],[x[1] for x in knn_func_D], 'b.')\n",
      "ax1_D.set_xlabel(\"Node degree (k = $ k_{in} + k_{out}$)\")\n",
      "ax1_D.set_ylabel(\"Average nearest-neighbor degree ($k_{nn}$) for nodes with degree k\")\n",
      "\n",
      "fig_D.tight_layout()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "Small-world connectivity and vertex eccentricity"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# returns a list of eccentricities, in the same order as D_giant.vs\n",
      "# If you want to see this, you have to be patient. It takes a few minutes (~ five)\n",
      "eccentricities_D = D_giant.eccentricity(mode = 'ALL') # mode = 'All' means ignoring edge direction\n",
      "eccentricity_avg_D = sum(eccentricities_D)/len(eccentricities_D)\n",
      "eccentricity_max_D = max(eccentricities_D)\n",
      "print(\"Average eccentricity: {}\".format(eccentricity_avg_D))\n",
      "print(\"Maximum eccentricity: {}\".format(eccentricity_max_D))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "ecc_D = histogram(eccentricities_D).items()\n",
      "plt.bar([x[0] for x in ecc_D], [x[1] for x in ecc_D])\n",
      "plt.xlabel(\"Eccentricity values\")\n",
      "plt.ylabel(\"Frequency\")\n",
      "plt.show()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "Community structure in a real social network"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "D_dendrogram_wt = D_giant.community_walktrap(weights = 'weight')\n",
      "D_communities_wt = D_dendrogram_wt.as_clustering(500) \n",
      "# 500 mean that I'm arbitrarily choosing that there should be roughly 500 clusters.\n",
      "# Not a great thing to do, but otherwise we get so many clusters that it's not useful as an example\n",
      "D_dendrogram_wt.optimal_count"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "igraph.plot(D_communities_wt.cluster_graph(), vertex_size = [len(x)**(.5) for x in D_communities_fg])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "That looks...spiky."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "D_undirected = D_giant.as_undirected(mode = 'collapse', combine_edges = {'weight':sum, 'conversation': concat_list})\n",
      "D_dendrogram_fg = D_undirected.community_fastgreedy()\n",
      "D_communities_fg = D_dendrogram_fg.as_clustering()\n",
      "D_dendrogram_fg.optimal_count"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Look at how communities are connected\n",
      "igraph.plot(D_communities_fg.cluster_graph(), vertex_size = [len(x)**(.5) for x in D_communities_fg])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "So what? We can see that the community finding algorithms give pretty different results, and we know that somebody has to be off. How can we say that they are giving us anything useful?  \n",
      "\n",
      "Well, one thing we can see in that last graph is that we seem to have two very big clusters opposite eachother. Let's see what kind of conversation-edges those clusters contain."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "cluster_sizes_D = [len(x) for x in D_communities_fg]\n",
      "sorted_cluster_sizes = sorted(cluster_sizes_D,reverse = True)\n",
      "percentages = [sum(sorted_cluster_sizes[0:x])/float(sum(cluster_sizes_D)) for x in xrange(1,len(cluster_sizes_D))]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Arbitrary cuttoff time. The 9 largest community clusters in D_communities_fg encompass $> 90\\% $ of the vertices in the graph,  let's care about the biggest communities first."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Make a list of all of the conversation-edges in each community:\n",
      "# conversations is a list of dicts (where each dict )\n",
      "conversations = [{} for x in range(0,len(D_communities_fg))]\n",
      "for i, cluster in enumerate(D_communities_fg):\n",
      "    sample_subgraph = D_undirected.subgraph(cluster)\n",
      "    edge_c = sample_subgraph.es['conversation']\n",
      "    cs = []\n",
      "    for c in edge_c:\n",
      "        cs.extend(c)\n",
      "    conversations[i] = sorted(histogram(cs).items(), key=lambda x: x[1], reverse = True)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "conversations[cluster_sizes_D.index(sorted_cluster_sizes[0])]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "conversations[cluster_sizes_D.index(sorted_cluster_sizes[1])]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "conversations[cluster_sizes_D.index(sorted_cluster_sizes[2])]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "conversations[cluster_sizes_D.index(sorted_cluster_sizes[3])]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [],
     "language": "python",
     "metadata": {},
     "outputs": []
    }
   ],
   "metadata": {}
  }
 ]
}